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k*" ■ Abstract 

o 

We introduce a Bayesian extension of the latent block model for model-based block 
clustering of data matrices. Our approach considers a block model where block pa- 
rameters may be integrated out. The result is a posterior defined over the number 
of clusters in rows and columns and cluster memberships. The number of row and 
■ column clusters need not be known in advance as these are sampled along with cluster 

. memberhips using Markov chain Monte Carlo. This differs from existing work on la- 

tent block models, where the number of clusters is assumed known or is chosen using 
some information criteria. We analyze both simulated and real data to validate the 
technique. 
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o\ ■ 1 Introduction 

(N , 

Many data sets arise as a result of a number of features or variables being observed for a 
collection of objects. As examples; shoppers and the items which they do or do not buy; 
whether a document contains specific words or not; the expression levels of a gene under a 
series of conditions in a DNA experiment. Such data will be recorded in a matrix, say, with 
rows indexing objects and columns indexing features or variables. Often interest will focus on 
clustering rows and further, clustering the features which distinguish these row clusters. We 
refer to this task as block clustering although it is also known as block modelling, biclustering, 
co-clustering and two-mode clustering. 

One of the first approaches to block clustering was suggested by Hartigan (1972) and 
since then, many have been proposed. Much recent work in block clustering and related 
areas has been either on the analysis of microarray data (Tibshirani et al. 1999, Cheng & 
Church 2000, Getz et al. 2000, Lazzeroni & Owen 2002, Kluger et al. 2003) or document 
classification (Hofmann 2001, Blei et al. 2003, Griffiths & Steyvers 2004). Approaches vary 
in whether they allow clusters to overlap or not. In our case, the problem can be thought of 
as permuting the rows and columns of the data matrix to make a "chessboard" of blocks of 
data having similar value. 

While many approaches to document classification are model-based i.e. a parametric 
underlying model is assumed when clustering data, this is often not the case in microarray 
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analysis. Some exceptions are Lazzeroni & Owen (2002) who assume a Gaussian error model 
for gene expression with additive effects for gene and condition clusters and Sheng et al. 
(2003) who assume a multinomial model for expression level in a discretized microarray. It 
is common to use two-way hierarchical clustering for this data or other partitioning methods 
(for example Getz et al. (2000)). One drawback of these methods is the lack of probabilistic 
justification as noted by Wit & McClure (2004) (Chapter 7, page 171). A model-based 
approach allows explicit modelling of noise in the data. This can be an advantage in data, 
such as microarrays, which is particularly prone to noise, incorporating uncertainty in cluster 
membership. 

In this paper we consider an extension of the latent block model (LBM) approach of Go- 
vaert & Nadif (2008). The LBM was developed as an intuitive extension of the finite mixture 
model used in model-based clustering (Fraley & Raftery 2002) to allow clustering of objects 
and features. We propose a Bayesian LBM. This has been considered previously by van Dijk 
et al. (2009). In their approach the number of clusters in objects and features is assumed 
known and Gibbs sampling is used to find clusterings. They choose the number of clusters 
using an information criterion based on maximum likelihood. 

We show that it is possible to sample the number of clusters and the cluster membership 
jointly using simple Markov chain Monte Carlo (MCMC) on a collapsed model, so that 
uncertainty in the number of clusters is naturally incorporated as part of our Bayesian LBM. 
The collapsed model is obtained by integrating out block parameters analytically. This is 
possible using standard prior assumptions. There is no need to resort to a trans-dimensional 
sampler, such as the reversible jump sampler of Green (1995). Our idea extends the allocation 
sampler of Nobile & Fearnside (2007) to two directions, with slight modifications. We discuss 
differences and similarities of our approach to block clustering with those which are most 
comparable qualitatively. The sampler is applied to both simulated and real datasets to 
gauge performance. 

The remainder of the paper is organized as follows. Section 2 reviews the LBM and 
introduces the collapsed Bayesian LBM. Section 3 gives the MCMC sampler which we use. 
The differences between our sampler and reversible jump samplers are also discussed. Label 
switching is mentioned and the section concludes with approaches to summarize the output 
of the sampler. Section 4 applies the approach to simulated data. In Section 5 we analyze 
voting records data from the U.S. congress and compare this to a maximum likelihood 
analysis. In Section 6 we discuss the analysis of microarray data, and use our sampler to 
analyze data from a yeast microarray experiment. The paper concludes with a discussion. 

2 Models 

The data is Y = (yij), an n x m matrix. It is assumed rows and columns may be reordered 
so that the matrix can be represented as K x G blocks with data in blocks modelled by 
the same density, where K and G are the number of row and column clusters respectively. 
This could be imagined as a "chessboard" effect, with K — 1 divisions in the direction of 
the rows and G — 1 in the direction of the columns. The parameters of the data density are 
conditional on the block and 9k g denotes the parameters for block (k,g), with O denoting 
the collection of these. We now give a review of the LBM of Govaert & Nadif (2008). 
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2.1 Latent block models 

Conditional on K and G, let U be a latent space indexing the set of all possible clusterings 
of rows and columns. Then the distribution of the data Y can be written 

p(Y\K, G, 6, 0) = £ p(u\K, G, (j>)p(Y\K, G, u, 6) 

where 4> are parameters for the distribution of u. Govaert & Nadif (2008) make the as- 
sumption that row and column clusterings are independent a priori, so that p(u\K, G, <f>) = 
p(z\K, u:)p(w\G, p) where Zi = k if row i is in cluster k and Wj = g if column j is in cluster 
g. The probability of a row belonging to cluster k is uik and p g denotes the probability that 
a column belongs to cluster g. The LBM is then 

p(Y\K,G,e,u,p) = £ p(z\K,u>)p(w\G,p)p(Y\K,G,z,w,Q) (1) 

(z,w)e2xW 

where Z and W denote the latent spaces of all row and column clusterings respectively. 
When constructing the data likelihood given the latent allocations, we make the assumption 
of local independence. That is, within a block, data are independent. This gives data 
likelihood conditional on z, w, 

P (yi^,G,z,w,e) = nn n n pwo- 

fc=l g=l i:Zi=k j-Wj=g 

As \Z x W| = K n G m , it is not feasible to calculate ([T]). We now review a way to fit this 
model using a method based on Expectation-Maximization (EM) (Dempster et al. 1977) due 
to Govaert & Nadif (2008). 

2.1.1 Estimation using BEM2 

Here we outline the BEM2 algorithm of Govaert & Nadif (2008) which we will compare our 
approach with later (Section [5]). In the interests of avoiding ambiguities in notation, we 
introduce the matrices r = (r^) and c = (cj g ) such that if row i is in cluster k, = 1 and 
otherwise. Similarly for cj g . The complete (or classification) log-likelihood associated with 
the LBM CQ) is 

n K m G n m K G 

£(r, c, w, p, e) = r ^ log w fc + c J9 l °sp 9 + EEE r ^ c ^ ^gM^IM- ( 2 ) 

i=l fc=l j = l g=l i=l j=l k=l g=l 

The E step using this log-likelihood directly is intractable due to the dependence structure 
among the rows and columns. Govaert & Nadif (2008) suggest a variational approximation 
to the joint distribution of the latent r, c which leaves r and c independent. Then using the 
interpretation of EM due to Neal & Hinton (1998) this leads to a new "fuzzy" criterion for 
block clustering 

n K m G 

£?(s,t,w, p, e) = £(s,t,£j,p,e) -EE s ^ lo g^ -EE^ 1 ^^ 

i=l k=l j=l g=l 
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which can be alternately maximized with respect to s, t and uj, p, Q where Sik = Pr(zj = k) 
and tjg = Pr (wj = g). 

The possible ways in which this criterion may be alternately maximized determines dif- 
ferent algorithms. The BEM2 algorithm maximizes it as follows. 

1. Initialize the unknowns s, t, u>, p, G at some sensible value. 

2. Maximize Q with respect to s, uj and G keeping t and p fixed. 

3. Maximize Q with respect to t, p and G keeping s and uj fixed. 

4. Iterate steps 2-3 until convergence. 

It is noted that each sweep of BEM2 has two maximizations of G. This maximization 
procedure is reported to have outperformed the other schemes considered in Govaert & 
Nadif (2008), so we use it here to compare with our Bayesian approach. 

2.1.2 Choosing K and G when using BEM2 

In dTJ, it is assumed that K and G are known. The number of clusters assumed can have 
a considerable effect on the output of clustering algorithms. Usually, many runs, each with 
a different number of clusters, are necessary. These are then compared to find the best 
clustering, either based on some information criterion or visual inspection of plots. 

Since the LBM is defined in terms of the latent allocation vectors z and w, it is not clear 
how one could use a standard information criterion (e.g. BIC (Schwarz 1978)) here to choose 
the number of components best supported by the data. Evaluation of the log density of Y 
at the MLE requires a sum over all K n G m terms as in (JTJ), which even for the moderate case 
of K = 2, G = 2, n = 10, m = 10 would require the sum of roughly one million log-likelihood 
evaluations. An alternative may be to use the maximized complete log-likelihood treating 
the row and column allocations as unknown parameters, C(r, c, u>, p, G) . van Dijk et al. 
(2009) have used this approach for LBMs when using AIC-3 (Bozdogan 1994) to choose K 
and G. In this case the number of parameters to be estimated is n(K — 1) + m(G — 1) + 
dKG + (K — 1) + (G — 1) where d is the dimension of any 9k g - A separate model estimation 
is required for each K and G combination over a grid of plausible models. Adopting this 
type of approach crucially involves replacing the maximum log likelihood with a maximized 
complete log likelihood and also the issue of selection of a particular information criterion, 
and could therefore be criticized for both reasons. 

The Bayesian LBM we propose seeks to incorporate uncertainty in K and G into the 
model. This is so that the clustering task is also one of cluster model determination. The 
model determination task and the allocation task are dealt with simultaneously through a 
fully Bayesian approach. This has analogy with some other block clustering strategies, which 
undertake greedy searches to find new row and column clusters. See for example Hartigan 
(2000). An advantage here is that the search has a probabilistic justification based on a 
posterior distribution for K and G. In the next section we introduce the Bayesian LBM 
which is at the core of the clustering procedures we discuss. 
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2.2 Bayesian latent block models 

The Bayesian LBM is formed by taking prior densities on K, G, O, a> and p. Let n(-) 
denote prior and posterior densities. Then we may write down the posterior of the number 
of clusters and latent cluster allocations from Bayes' theorem 

ir(K,G,z,w,u,p,e\Y) oc p(z\K, uj)p(w\G, p)p(Y\K, G, z, w, 6) 

xti(Q\K, G)tx(uj\K)tx(p\G)ti(K)ti(G). (3) 

Adopting a conjugate prior for u;, p and each 9 kg allows one to integrate these from the 
posterior analytically. We call this collapsing. Doing this allows us to obtain the marginal 
posterior tt(K,G,z,w\Y). Samples can be generated from this posterior using the MCMC 
sampler of Section [3j This is similar to the general approach of Nobile & Fearnside (2007). 
The idea of collapsing has been used by Sheng et al. (2003) in the analysis of a discretized 
microarray and by Griffiths & Steyvers (2004) in latent Dirichlet analysis for document 
classification. It would be possible to estimate this model without integrating out parameters 
by using reversible jump MCMC (RJMCMC) (Green 1995). We discuss this further in 
Section 13.21 

We choose a standard conjugate prior for each of the parameters to be integrated out. 
For example, uj ~ Dirichlet(a, . . . , a) and p ~ Dirichlet(/3, . . . , (3) a priori. For the examples 
considered in this paper, we take the non-informative values a = 1, /3 = 1. The prior on 9 kg 
will depend on the distribution assumed for the data. For the most widely used models, a 
standard conjugate prior will be available. The 9k g are assumed independent a priori. This 
leads to the posterior 

' z ' w|F) K «w<G) v{a]KV{n+aK} — vm c V{m+m n n m*. 

(4) 

where 

M kg = J-K{9 kg ) Yl II P(yij\°k g ) d9 kg . 

i-.Zi=k j:Wj= g 

where n k is the number of rows in cluster k and m g is the number of columns in cluster g. 
An outline of the calculation of the posterior is given in Appendix A. 

The priors for the number of clusters, tc(K) and tt(G) are taken to be truncated Poisson(l) 
over the ranges 1, . . . , K mSuX and 1, . . . , G max . Examples of Poisson priors being adopted for 
the number of components include Phillips & Smith (1996) and Stephens (2000). The use 
of a truncated Poisson(l) prior has been justified in Nobile (2005). We did experiment with 
a uniform prior on the number of clusters. However, we found that this gave unnecessary 
empty clusters in some situations. 

We now give the M kg for two useful data models which we will use in examples later. 

2.2.1 Bernoulli model for binary data 

Assume that Pr(?/y = l\zi — k, Wj — g) — 9 kg . We take a Beta(7, S) prior on 9 kg . Then 

T{7 + 5} r {s kg + 7} r {n k m g - s kg + 5} 



M, 



kg 



r{ 7 }r{5} Y{n k m g + 1 + 5} 
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where the block sufficient statistic s kg = Uij- Further detail on the calculation is 

i-.Zi=k j:wj=g 

given in Appendix B. 



2.2.2 Gaussian model for continuous data 

Assume Uij\zi = k,Wj = g ~ N(fi kg , o\ g ). Take the priors fi kg ~ N(£,t 2 o| s ) and o\ g ~ 
IG(£/2, 7/2) where IG(a, b) is the Inverse- Gamma distribution: p(x) = f^y ;r ~ ( ' a+1 ' ) exp{— b/x}. 
Then 



7r n fc m fl /2 r { ( j/2} (n fc m 9 r 2 + l) 1 




where Sfe s = X) J] yjj and ss kg = J2 yfj- Further details on calculating M kg are 

i-.Zi=k j:Wj=g i:Zi=k j:utj=g 

given in Appendix B. 

3 MCMC sampling of clusterings 

The sampler which we propose consists of four different moves. The first is just a standard 
Gibbs update for the row/column label. The second proposes to reallocate collections of 
rows and columns. The final two moves propose to add or remove clusters. We describe the 
moves for rows, but they apply to columns analogously. When running the algorithm, the 
moves are each applied to the rows and columns in a single sweep. Since the LBM will be 
invariant to cluster labellings, we will encounter the label switching problem. We outline 
how to deal with this as well as discussing how to summarize the output from the sampler. 

3.1 MCMC moves 

3.1.1 Gibbs sampling to update the allocation of one row 

Suppose row i is currently in cluster k. We then sample its new allocation, Zi from the 
distribution 



P (g i = A/|y,^g,z- i ,w)(x nk>Jr ™ n ^?y >«* k ^ 

n k -l + a M k , g M kg 



and p(zj = k\Y,K,G,Z-i,w) oc 1 where and are obtained respectively by 

removing row % from cluster k and adding it to cluster k' within column cluster g. The total 
computational effort required for the Gibbs sweep on rows and columns is 0((n + m)KG) 
which may be prohibitive for large K,G,n or m. It is possible to move one row and column 
between clusters using a Metropolis-Hastings move. This could be alternated with a Gibbs 
update to reduce computational overhead or some mixture of the two moves could be used. 
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3.1.2 Move to update the allocation of more than one row 



This move is similar to move M3 in Nobile & Fearnside (2007). Its role is to move more 
than one row at a time. The way in which new row allocations are proposed should isolate 
clusters more quickly than just performing one row Gibbs updates. The procedure is as 
follows. Choose two row clusters k and kl at random. Let S be the index set of rows currently 
belonging to clusters k and k! . The members of 5* are randomly reordered. Imagining clusters 
k and kl to be empty initially and S to be full, we sequentially take each row from S and 
allocate it to k or kl . This allocation is done using the probability that the current clusters 
k or kl generated that row conditioning on rows that have already been reallocated to k or 
kl . For row % in S these probabilities are denoted by p^ k and p k ) with p k + p k ) = 1- To 
write down the proposal probability of this move we use M kg , g = 1, . . . , G to represent the 
integrated likelihood of the members placed in cluster k before member i has been processed. 
Similarly n k represents the number of rows in cluster k before i has been processed. Then 
using similar notation to the Gibbs move it can be shown (see Appendix A. 2, Nobile & 
Fearnside (2007)) that 

pf n k + a^\M k , g Mif- 

Using + p k ) = 1, the above can be solved for p k ). The proposed allocation of row i, Zi 
may then be sampled. Once the quantities n k >,n k , M k > g and M kg have been updated based 
on Zi, the next row in 5* can be dealt with. 

When all members of S have been processed the proposal probability of moving from z 
to z is 

For the reverse move the proposal probability is 

— - — TTp w . 

The new allocation z is then accepted with probability min(l, A) where 

A = T{h k + a}T{n k , + a} ° M k , g M kg ^ „ p^_ 
T{n k + a}T{n k , + a} ^ My g M kg X ^ pf ' 

and h k ,n k i, M k t g , M kg are the proposed cluster sizes and integrated block likelihoods when 
all the members of S have been processed. 



3.1.3 Moves to split or combine clusters 

To add a cluster we first randomly propose a cluster, k, to "split". The new cluster will 
be labelled K + 1 if the current number of clusters is K. In the same way as the move to 
reallocate more than one row (Section 13.1 .2p . the probability of a row proposed as being in 
cluster k or K + 1 is given by the conditional probability it was generated by that cluster, 
the rows being processed sequentially. Clearly the order in which rows are processed is 
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important. Thus for the split and combine moves we place an ordering on the members of 
cluster k, that is, the order in which the members are arranged in cluster k is important. 
As well as taking members out from cluster k and placing them in cluster K + 1, this is 
important when we place all members back into cluster k in the combine move. It is possible 
to propose a label swap of K + 1 with any other label selected at random (itself included), 
say k'. This then would split cluster k into clusters k and some {1, . . . , K + l}\k. 

Let S denote the index set of rows currently belonging to cluster k. We choose a split 
move with probability p§ . For the split move, the denominator in the proposal ratio will be 

! ' (z ^ 2) = ^ '*'7f(7fTT)i^4 , • 

where the second term accounts for selecting the cluster to split, and then the cluster to 
swap labels with, the third term accounts for the number of ways in which members may 
be arranged (processed), and the fourth term is the product of conditional probabilities (see 
Section [3X2]) . 

For the combine move, two clusters are selected at random, say k and k' from the K + 1 
available. Then all members of cluster k! are proposed to be placed back in k. Thus the 
numerator in the proposal probability for the split move is 



p(z -»z) = (1 -pg 



K(K+l)n k \ 



where the first term is the probability of proposing a combine move, the second accounts for 
the clusters selected, and the third accounts for the number of ways in which the members 
of cluster k may be arranged. 

The acceptance probability for the split move is then min(l,v4) where 

ir(K + 1) T{n + aK} T{a(K + 1)} Y{h k + a]Y{n k , + a} 
7i(K) T{n + a{K + 1)} T{a}F{aK} T{n k + a} 

*n%^xi^(rM>r 

g=l 1V1 kg Ps \j e s / 

and n k , n k >, M kg , M k / g give the proposed sizes and integrated block likelihoods of the proposed 
clusters. 

The acceptance probability for the combine move is min(l, A~ l ). These moves are similar 
to the "split and combine" moves discussed by Richardson & Green (1997). Our experiments 
showed that they gave satisfactory mixing and higher acceptance rates than proposing empty 
clusters. 



3.2 Form of reversible jump sampler 

As noted, the moves discussed in the previous section resemble moves used in the RJMCMC 
sampler of Richardson & Green (1997) and other RJ samplers for related classification prob- 
lems (for example Robert et al. (2000)). The difference with our sampler is that the space 
we sample from is of fixed dimension. This is due to collapsing. Performing an equivalent RJ 
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analysis to that presented here would be challenging for LBMs. This would mean extending 
the Gibbs sampler of van Dijk et al. (2009) to include variable dimension moves for splitting 
or combining clusters. The construction of proposal densities for variable dimensional moves 
in RJ samplers can be crucial to their performance. Work has been done in this area (Brooks 
et al. 2003, Green 2003), but for many applications construction of proposals is case spe- 
cific. The reason for entertaining a RJ analysis here is that we are not only concerned with 
finding a cluster allocation for a specified LBM. We are also interested in exploring different 
cluster models, and so the task also becomes one of model determination as discussed in 
Section 12X21 

Consider splitting a row cluster k into k and k! in a typical RJ approach. This is more 
difficult than the component splitting case in Richardson & Green (1997), since splitting 
each row cluster gives rise to d(G + 1) new parameters where d is the dimension of any 9k g - 
Finding a proposal that will mix well may require lots of trial and error, especially if d > 1 
or G is even moderately large. Moreover, computational time would increase dramatically 
with respect to the collapsed LBM in these situations. 

Using a collapsed model, is, in a sense, a form of variance reduction for this model. We 
reduce variability in sampling of allocations, by integrating out oj, p and G. This should 
give better sampling of the high probability clusterings of the data, since uncertainty due to 
parameter values has vanished. 

3.3 Label switching 

The joint posterior of cluster models and allocations or labels (j4]) is invariant to label switch- 
ing, that is, the labels are not identifiable. If there is one labelling 1,...,K of the rows, 
then any permutation of this, say, 0"(1), . . . , cr(K), gives exactly the same information about 
clustering relationships. The posterior on row labels has K\ indistinguishable modes. Gen- 
erally as the Markov chain progresses, we will observe switches between these equivalent 
modes; the well known label switching phenomenon. In our case label switching can occur 
for row and column labels independently. There are many approaches for dealing with the 
label switching problem (Stephens 2000, Celeux et al. 2000). The approach we adopt here is 
due to Nobile & Fearnside (2007). It is ideal for our purposes since it does not involve loss 
functions based on sampled model parameters (which are no longer in our model). It just 
requires the samples of z and w. 

We now outline the procedure we use to deal with label switching. Some more details are 
given in Appendix C. We post-process row and column allocation vectors separately. The 
re-labelled data can then be used to compute posterior probability of cluster memberships 
and other quantities of interest. To post process the label vectors zi,z 2 , ... output from 
MCMC we begin by arranging these in order of increasing number of non-empty components. 
This gives the ordering z^,z' 2 ', . . ., where for s < t, z^ uses either the same number of 
components as in total, or less. For example, with K = 4, z^ = (3, 3, 2, 2, 2, 1) would 
come before z® = (4,4,3,3,1,2). Suppose we have processed and re-labelled the vectors 
z^ up to time T — 1, and there are Kt-i non-empty components in z^ T_1 ). Compute a cost 
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matrix with general element 

T-l n 

C(kuh) = £ £ i {z? + h, zP = k 2 } . 

t=l i=l 

Then the more disagrees with the vectors already processed, the higher this cost will 
be (see Appendix C). The square assignment algorithm of Carpaneto & Toth (1980) returns 
the permutation cr(-) of the labels in which minimizes the total cost X^f 1 C(k, cr(k)). 
We then relabel by permuting the labels according to &(■). 

3.4 Summarizing MCMC output 

Having sampled both the number of clusters and cluster memberships, it will be of interest to 
give a summary of the sampling. As different (K, G) cluster models are structurally different, 
it is not possible to give an "average" of cluster membership. We suggest two summaries. 

3.4.1 Using the modal cluster model 

The first summary focuses on using the modal, or most visited model from the MCMC 
output. It takes the series of (K, G) visited models and chooses the pair which appear most 
often. Call this pair (K , G) . Suppose this pair has occured N times in the post burn- in 
sample. We extract the iV pairs of label vectors z and w corresponding to these occurrences. 
We then post process these label vectors using the procedure to undo label switching outlined 
in the previous section. This will be necessary to compute posterior distributions of row and 
column cluster membership. After computing the posterior distributions of row membership, 
row i has distribution (qn, . . . , q^) where q^ is the estimated posterior probability row i 
belongs to cluster k in the (K, G) cluster model. For the summary we assign i to cluster 
argmaxfc q ik . The columns are given the same treatment. 

3.4.2 Using the MAP 

Since we are sampling from the fixed dimensional posterior tt(K, G, z, w|Y), the maximum 
a posteriori (MAP) cluster model and cluster membership (K, G, z, w) MAP is also a useful 
summary of the MCMC output. The MAP gives the visited (K, G, z, w) having highest 
probability a posteriori from the samples obtained. 

4 Simulation experiment 

To see how the sampler discriminates between different cluster models, it was run on some 
simulated data. We generated three 200 x 200 binary matrices with 4 and 4, 2 and 5 and 1 
and 4 row and column clusters respectively. In each case, the block parameter 8k g was drawn 
uniformly from [0, 1]. The blocks were then generated using Bernoulli(#fc 9 ) random variates. 
This is shown in the left of Figure [TJ Clusters were made less distinguishable by transforming 
the generated 6k g to the intervals [0.2,0.8] and [0.3,0.7] using 6 k a ^ = a + 6 kg (b — a) and 
generating two further matrices. The rows and columns of the resulting matrices were then 
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Figure 1: Simulated data with decreasing distinguishability 
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(K, G) 


& kg 


PMP 


f 


(4,4) 


A 


0.9550 


8.79 




B 


0.9463 


10.57 




C 


0.9014 


17.43 


(2,5) 


A 


0.9343 


4.55 




B 


0.8886 


9.79 




C 


0.8369 


13.66 


(1,4) 


A 


0.8035 


7.86 




B 


0.3000 


8.97 




C 


0.1494 


4.61 



Table 1: Results of simulation experiment. The PMP gives the posterior model probability 
of the generating model, f is the estimated IAT. 

randomly reordered, disguising the data structure. The chain was run for 1000 burn-in 
iterations and a further 16,000 iterations on each data set. We assumed a Beta(l, 1) prior 
for 9kg in all cases. The priors for u) and p are as in Section 12.21 

We looked at two performance diagnostics of the sampler. The first was the posterior 
model probability (PMP) of the model used to generate the data and the second was the 
integrated autocorrelation time (IAT) of sampled cluster models. Computing the PMP just 
amounts to counting the number of times the model in question was visited and dividing 
by the total number of samples. For the IAT we identify cluster models (K, G) by a model 
index R — 1, . . . , K max G max . Then we estimate the quantity r = 1 + Pr(^), where 

/?#(£) is the autocorrelation of the series of post burn-in samples Ri,R 2 , ... at lag t. The 
series here refers to the cluster models sampled from the posterior (T4|). Lower values of the 
IAT indicate better mixing and better performance of the MCMC sampler. The IAT can 
thus be used as a measure of efficiency for MCMC algorithms. See for example Liu (2004), 
Chapter 5 and Roberts (1996). 

The results are shown in Tabled! The 9k g column is coded A for 9k g ~ Uniform[0, 1], B 
for transformation to [0.2,0.8] and C for [0.3,0.7]. As the noise in the data increases, the 
ability to identify the model which generated the data decreases. This is to be expected. The 
estimated IAT indicates that we get less efficient sampling as the noise increases, with the 
exception of the (1,4) cluster model. This is a particularly challenging situation, since two 
of the clusters are very similar (see Figured]). The transformed 9u,9 i3 were for [0.2,0.8] : 
0.288,0.320 and for [0.3,0.7] : 0.36,0.38. The fact that these two clusters are practically 
indistinguishable would make the sampler choose a (1,3) model as the best model after 
scrambling of the data. In fact the most visited model in both these cases had 1 row cluster 
and 3 column clusters (62.37% and 82.66% of the posterior probability). In this situation, 
the best cluster model was not the same as the generating model. This is an artifact of the 
simulation process but shows that a sensible clustering can be achieved. 
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5 Congressional voting in US senate 



We apply the sampler to the UCI Congressional Voting data assuming the Bernoulli model 
of Section 12.2.11 The data records whether 435 members of the 98 th congress (267 democrats, 
168 republicans) voted "y a y'\ " na y"> abstained or were absent in votes on 16 different key 
issues. Here the members of congress are represented by rows, and the issues are represented 
by columns. The data is available from 

http : //archive . ics .uci . edu/ml/datasets/Congressional+Voting+Records 
and is shown in the left panel of Figure |2j The aim was to see whether the sampler could 
discover any clustering by party and issue. For example, one may expect that democrats 
voted differently to republicans on certain issues. It was thought best to ignore absent and 
abstain votes. Here, this is equivalent to treating these votes as a "nay", since our focus is 
on clustering rows and columns. The only sample sizes entering into our calculations are the 
number of rows and columns in each cluster. For the Bernoulli model, the block sufficient 
statistic is the sum of the data. This is not affected by a missing data point. We do point out 
however that missing data could be easily imputed by inclusion of a Gibbs step to sample 
from the full conditional of any missing point. 

The sampler was run for 110,000 iterations with 10,000 as a burn-in initialized at the no 
cluster model. To reduce correlation in samples, we took every 10 sample after burn-in. 
The move of Section 13.1.21 had a 16% acceptance rate for rows and 51% for columns. The 
cluster split and combine moves had about a 1.5% acceptance rate for rows and 8% for 
columns. The run took just over an hour on a 2.5GHz processor. 

Table [2] shows the distribution of the number of row and column clusters. It can be seen 
that about 60% of the posterior probability is placed on 6/7 row clusters and 12/13 column 
clusters. We extracted the samples with 7 row clusters and 12 column clusters to construct 
an estimated clustering following Section 13.4.11 The estimated clustering is shown in the 
middle panel of Figure FJJ The red horizontal lines here divide the clusters of congressmen, 
and the blue vertical lines divide the issue clusters. When referencing Figure [2] we say that 
the congressman clusters (rows) are numbered 1 to 7 top-bottom, and the issue clusters 
(columns) are numbered 1 to 12 left-right. 

Issues have only three non-singleton clusters. The first contains "anti-satellite-test- 
ban", "aid-to-nicaraguan-contras" and "mx-missile" (column cluster 1). The second con- 
tains "physician-fee-freeze" and "education-spending" (column cluster 3) and the third has 
"handicapped-infants" and "duty-free-exports" (column cluster 6). Row cluster composi- 
tion by party is shown in Table P3J The majority democrat party roughly splits into four 
clusters, while the republican party splits into two. The main discrepancy between the two 
large democrat clusters, 2 and 3, appear to be the issues "religious-groups-in-schools" and 
"crime" in issue clusters 9 and 12. Row cluster 6 which is also mainly democrat appears 
to vote similarly to the republican cluster 1. Row clusters 4 (democrat) and 5 (republican) 
appear to deviate from their core party vote. 

We compared the results obtained from our algorithm with those obtained from the 
BEM2 algorithm of Govaert & Nadif (2008), reviewed in Section P2.1.1I The algorithm was 
run using 7 row clusters and 12 column clusters. It should be noted that BEM2 requires 
the number of row and column clusters to be assumed known in advance. To obtain an 
estimated clustering, we took the cluster with the maximum probability of membership. 
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Voting data 



collapsed LBM 



BEM2 




Figure 2: Voting data. Colour key: white = "nay", black = "yay". Left panel: Raw data. 
Right panels: summary cluster membership from the modal 7 row and 12 column cluster 
model and the cluster membership obtained from BEM2. Row clusters are numbered 1- 
7, top-bottom. Column clusters are numbered left-right. The red lines divide clusters of 
congressmen, and the blue lines divide the issue clusters. 
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Table 2: Distribution of cluster models for voting data 
Cluster Democrat Republican 



1 (131) 8 123 

2 (125) 125 

3 (77) 71 6 

4 (38) 37 1 

5 (36) 3 33 

6 (23) 21 2 

7 (5) 2 3 



Table 3: Party distribution over row clusters from collapsed sampling. 

The composition by party of the row clusters from BEM2 is shown in Table HI Row cluster 1 
is similar in both, but there are some differences in the other clusters. The BEM2 clustering 
only used 10 column clusters of the 12 available. The right panel of Figure [2] shows the 
clustering from BEM2. For comparison purposes with the collapsed LBM clustering, the 
columns have been arranged in the same order. The collapsed LBM appears to identify 
more small clusters, leading to a marginally more homogeneous blocking of the data. 

Cluster Democrat Republican 



1 (131) 8 123 

2 (104) 104 

3 (62) 61 1 

4 (60) 50 10 

5 (35) 5 30 

6 (30) 26 4 

7 (13) 13 



Table 4: Party distribution over row clusters from BEM2 algorithm. 

There is an advantage here over an EM approach to fitting a cluster model in that the 
number of clusters need not be assumed known in advance. Here, model uncertainty is 
naturally inbuilt into the approach and it is dealt with automatically by the sampler. There 
is no user intervention to choose the cluster model. User intervention is only in the choice of 
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prior hyperparameters and priors on the number of clusters. In our experience, standard non 
informative priors for the data model parameters, and the truncated Poisson(l) prior on the 
number of clusters (as argued in Nobile (2005)), seem to perform well. The computations for 
the collapsed LBM are also numerically stable if clusters empty out. In our experimentation 
with EM algorithms this caused instability. Empty clusters could easily occur, say, if the 
chosen cluster model is not well supported by the data. 

6 Microarray experiments 

A DNA microarray experiment records expression levels of a large number of genes over a 
number of conditions or samples. The number of conditions or samples is usually less than 
100, while the number of genes could be in the thousands. Discovering which genes behave 
similarly and under which subgroups of conditions is the aim of analysis. One way to do 
this is to group together genes with similar expression levels. Methods differ in whether they 
allow clusters to overlap or not. Here we will not allow clusters to overlap due to the form 
of the LBM. 

Analyzing DNA experiments can be challenging, due to the large row dimension and 
the general uncertainty in how many clusters may be present in the data. We apply our 
sampler to data from DNA experiments on the budding yeast Saccharomyces Cerevisiae. The 
microarray contains 419 genes and records the expression level of these under 70 conditions. 
It was obtained from the R package biclust (Kaiser et al. 2009). Expression levels lay 
between —6 and +7. The aim is to see how much structure the sampler can uncover, so 
the rows and columns of the microarray were randomly reordered (Figure H] (a)). In our 
application the rows represent the genes and the columns represent the conditions. 

We use the Gaussian model of Section 12.2.21 for expression level. This model requires 
specification of four hyperparameters. Two of these (7 and 5) are for the prior on the 
block error variances and two are for the prior on the block means (£ and r 2 ). We choose 

7 = 5 = 0.02 and £ = 0, r 2 = 100. This choice of 7 and 5 gives a proper density on the error 
variance which is non-informative (see for example Spiegelhalter et al. (1996)). Similarly, 
choosing £ = is a reasonable non-informative choice given the range of the data. Setting 
t 2 = 100 says that the prior information on a block mean is equal to 1% of the information 
in the observed expression level of one gene under one condition within that block. This is 
also non-informative. 

The sampler was run for 220,000 iterations with 20,000 taken as burn-in. We stored every 
20^ iteration thereafter. The run was time consuming, taking approximately 3 hours. This 
said, the large gene dimension of such an array does pose a challenge when searching for two 
way clusters. The initial cluster model assumed had 1 row and column cluster i.e. no cluster 
structure. Acceptance rates for the move of Section 13.1.21 were 25% for rows and 18% for 
columns. Split and combine acceptances were about 0.5% for rows and about 25% each for 
columns. The low acceptance rates of split and combine moves for rows would be expected 
since finding clusters will be more difficult in a larger dimension. 

Table gives the PMP of the visited models from the MCMC output. The model space 
visited by the sampler is large. The modal model (25 row clusters and 4 column clusters) 
gave a posterior probability of 11.98%. There is posterior support for anything from 3 to 
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(a) (b) 



Figure 3: Yeast data, (a) Original microarray (b) MAP clustering from sampler 

5 column clusters, for 23 to 26 row clusters (PMP > 0.02 in all cases). This is an example 
with considerable model uncertainty and it may be difficult to know the models to include 
using an information criterion over a grid of possible models as discussed in Section 12.1.21 
and adopted by van Dijk et al. (2009). Our approach has the obvious advantage of exploring 
the uncertainty in the posterior model space and attaching a probability to each model. 

Instead of constructing a summary based on the modal model, we took the MAP clus- 
tering here (shown in Figure H] (b)). The MAP had 26 row clusters and 4 column clusters. 
To have a closer look at the row clusters, we plot a selection of these in Figure HI The plots 
show the gene expression profiles for genes in the same row cluster over conditions arranged 
by condition cluster. It can be seen that in certain cases, there is a clear clustering of genes 
with similar profiles. The gene clusters shown are arranged by size (left-right, top-bottom). 
Some of the larger clusters appear quite noisy, while some follow a common trend closely. 
Auxiliary runs of the sampler on the subsets of row clusters could be performed to try and 
isolate further cluster structures. 

7 Conclusion 

We have considered a collapsed Bayesian extension of the Latent Block Model of Govaert 

6 Nadif (2008). We showed how an MCMC sampler could be used to sample both the 
cluster model and the cluster memberships when clustering a data matrix into blocks. The 
approach was demonstrated on simulated data and two real data examples. The application 
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Table 5: Posterior distribution of cluster models for the microarray data. 




Figure 4: Selection of row clusters from the MAP clustering. Each plot corresponds to 
a different row cluster. Each profile (black line) gives the gene expression level for over 
all conditions. Conditions are arranged by condition cluster membership. The condition 
clusters are separated by the red dashed line. 
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to simulated data suggested that the sampler's performance deteriorates as clusters become 
less distinguishable. We applied the sampler to Congressional voting records from the U.S. 
senate. It was shown to perform well in isolating clusters of congressmen and "yay" , "nay" 
votes in the data. In the second real data example, we used the sampler for analysis of a 
DNA microarray experiment. This demonstrated that there can be considerable uncertainty 
in the number of clusters in certain situations. Knowing even the range of possible models 
may be difficult. Results from the microarray experiment demonstrated that clusters could 
be found by a search strategy with a probabilistic basis using the collapsed LBM sampler. 
Overall, the approach seems to be a robust way to block cluster a data matrix. The user need 
only specify prior hyperparameters and priors on the number of clusters. Code implementing 
the sampler written in the C language is available at www.ucd.ie/statdept/jwyse. This 
code can be easily modified to experiment with different priors on the number of clusters. 
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Appendix 

Appendix A: Calculation of the posterior 

Writing out all posterior terms longhand, assuming u) ~ Dirichlet(o;, . . . , a) and p ~ 
Dirichlet(/3, . . . , j3) a priori gives 

n(K,G,z,w,u,p,e\Y) oc it{K)it{G) ^f] V{ ^] ft off ft ,C ft ft II II PivM 

1 1«J L \Pj k= i 9= l k=lg=li . z . =k j. w . =g 

x ft ft ft -r 1 ft pr 

k=lg=l k=l 9=1 

- ^ K M^ r{a}Kr{/3} G 1W iiP 9 

xftiHM n n p(yM 

k=lg=l i:Zi=kj:Wj=g 

Integrating the left and right hand sides of the above with respect to oj, p and O gives 

n(K, G, z, w| Y) oc n(KMG) v{a]Kv{n + aK} r{/3}Gr{m + Jq] g II k 9 

where 

M kg = jix(6 kg ) J| Yl P(yij\°kg)d9 kg 

i-.Zi=k j:wj=g 
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Appendix B: Calculation of Mk g 
Bernoulli model for binary data 

Assume that Pr(?/y = l\zi = k, Wj = g) = 9 kg . We take a Beta(7, 5) prior on 9 kg . Then 

<0k 9 ) n n p(vm = Sm Ci 1 -^" 1 n n o ^ 1 

i:Zi=k j:u>j=g XlS I J i:Zi=k j:Wj=g 

_ £{7 + 5} Sfc9+7 -l \n fc m 9 -s fc9 +5-l 

" r{ 7 }r{*p« ( fcflj 

Integrating the left and right hand side of this with respect to 9 kg gives 

T{7 + 5} T {s kg + 7} T {n fc m 9 - s kg + 5} 



r{ 7 }r{5} T{n k m g + 1 + 5} 



Gaussian model for continuous data 

Assume y%j\zi = k,Wj = g ~ N(fj, kg ,a kg ). Take the priors /i kg ~ N(£,T 2 a kg ) and cr 2 ^ ~ 
IG(5/2, 7/2) where IG(a, b) is the Inverse-Gamma distribution: p(x) = y^x^^ 1 ^ exp{— b/x}. 
Then 



(7/2) 



5/2 



i-.Zi=k j:wj=g 



T{5/2} 



x(2 7 rr 2 ^)- 1 / 2 exp{-(^ - OV 2 ^} 

x (27i(r 2 kg )- nkm ^ 2 exp {-(ss fc9 - 2/j kg s kg + n k m g ii 2 kg )/2a 2 kg } 

Completing the square on /i kg and integrating with respect to it gives 

(O \-n k m a /2 (t/ 2 )^ 2 -2((n k m a +8)/2+l) 

{ } T{5/2} ° 

x { n kmg r 2 + I)-/ 2 exp {- ^ (ss kg - + £ + ^) } ' 

Finally, integrating with respect to a 2 ^ and tidying up gives 

M ^ 2 T{(n k m g + 5)/2} ( r 2 (s kg + £/r 2 ) 2 f V^ 8 ^ 

Mfc * ^ mg/2r{V2} (nfemgT 2 + 1} V2 nfem9T 2 + i + r 2 + ^ 

Appendix C: Cost matrix for undoing label swiching 

The cost matrix for processing the vector is 

T-l n 

*=i i=i 
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The more disagrees with the vectors already processed, the higher this cost will be. This 
is made clearer by rewriting the general entry of the cost matrix: 

T-l n 

t=i i=i 

T-l n 

= n(T-l)-Y,Y,l{4 t] = h,zP = k 2 } 

t=l i=l 
n 

= n(T-l)-X;^(T-l,fci)I{^ T) = k 2 } (6) 
1=1 

where Ni(T — l,k{) gives the number of processed samples up to z^ T ~^ which have given 
label k\ to row i. For the sake of discussion, consider processing a sample where no label 
switching has occurred, K is fixed and there are no changes in labels from one MCMC sample 
to the next. In this case the costs will be 

n 

C(k,k) = n(T- 1) - £ Ni(T- l,k)I{zP = k} 

i=i 

= n(T-l)-n k (T-l) 
= (n-n k )(T-l) 

and for k! ^ k 

n 

C(k,k') = n{T-l)-J2Ni(T-l,k)l{zP = k'} 

i=l 

= n{T -l)-0 
= n(T-l). 

A cost of could only be obtained when all rows have the same label, that is, when there 
is no clustering. Of course this discussion simplifies the problem somewhat. The key is in 
finding a permutation of the labels to minimize all costs. This permutation is found by 
solving the square assignment problem using the algorithm of Carpaneto & Toth (1980) 
in our case. Finally, we note that ([6]) can be exploited to give an online post processing 
procedure. Define the K x n matrix S^ T ~^ with general entry ^ = X^=] 1 I{zf = k}. 
Then we have C(k,k') = n(T — 1) — J27=i Sm^H^^ = ^0- After calling the square 
assignment algorithm and permuting the labels z^ T ^ according to its solution, we can update 
S,wngS2P = sg- l) +I{zP = k}. 
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